Visualizing & Summarizing Numerical Data

STAT 313

Data Visualizations with ggplot2

What are the aesthetics in this plot?

What geometric object is being plotted?

Univariate (One Variable) Visualizations – For Numerical Data

  • Histogram
  • Boxplot
  • Density Plot

Histogram

ggplot(data = penguins, mapping = aes(x = bill_length_mm)) + 
  geom_histogram() +
  labs(x = "Bill Length (mm)")

Is this aesthetic global or local?

Pros

  • Easy to inspect
  • Higher bars represent where data are relatively more common
  • Inspect shape of a distribution (skewed or symmetric)
  • Identify modes

Cons

  • Do not plot raw data, plot summaries (counts) of the data!
  • Sensitive to binwidth

Boxplot

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm)) +
  geom_boxplot() + 
  labs(x = "Bill Length (mm)")

  • What calculations are necessary to create a boxplot?

  • What are strengths of a boxplot?

  • What are weaknesses of a boxplot?

Density Plot

ggplot(data = penguins,
       mapping = aes(x = bill_length_mm)) +
  geom_density() +
  labs(x = "Bill Length (mm)")

  • A smooth approximation to a variable’s distribution
  • Plots density (as a proportion) on the y-axis

Bivariate (Two Variables) Visualizations – For Numerical Data

  • Scatterplots

  • Faceted Histograms

  • Side-by-Side Boxplots

  • Stacked Density Plots (Ridge Plots)

Scatterplots

ggplot(data = penguins,
       mapping = aes(y = bill_length_mm, x = bill_depth_mm)) +
  geom_point() +
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)")

Multivariate Plots

There are two main methods for adding a third (or fourth) variable into a data visualization:

Colors

  • creates colors for every level of a categorical variable
  • creates a gradient for different values of a quantitative variable

Facets

  • creates subplots for every level of a variable
  • labels each sub-plot with the value of the variable

Colors in Scatterplots – Categorical Variable

ggplot(data = penguins,
       mapping = aes(y = bill_length_mm,
                     x = bill_depth_mm,
                     color = species)
       ) +
  geom_point() +
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)", 
       color = "Penguin Species")

Colors in Scatterplots – Numerical Variable

ggplot(data = penguins,
       mapping = aes(y = bill_length_mm,
                     x = bill_depth_mm,
                     color = body_mass_g)
       ) +
  geom_point() +
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)", 
       color = "Body Mass (g)")

Facets in Scatterplots – Categorical Variable

ggplot(data = penguins,
       mapping = aes(y = bill_length_mm,
                     x = bill_depth_mm)) +
  geom_point() +
  facet_wrap(~ species) + 
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)")

Facets in Scatterplots – Numerical Variable

ggplot(data = penguins,
       mapping = aes(y = bill_length_mm,
                     x = bill_depth_mm)) +
  geom_point() +
  facet_wrap(~ body_mass_g) + 
  labs(x = "Bill Depth (mm)", 
       y = "Bill Length (mm)")

Summarizing Numerical Data

Measures of Center

Not Resistant

Mean

Resistant

Median


Measures of Spread

Not Resistant

Variance

Range

Resistant

Inner Quartile Range (IQR)

Given this distribution…

What measure of center would you use? Why?

For right skewed data…

For symmetric (and bimodal) data…

Point Estimates & Parameters


Parameter: True value of the statistic for the population of interest


Point Estimate: provides our best guess for the value of the parameter


Estimates based on larger samples tend to be more accurate than those based on smaller samples.